1 Introduction

For high school students who excel in mathematics, the International Mathematical Olympiad (IMO) is the top international competition. The event, which was first held in Romania in 1959, has expanded significantly from the seven original participating nations to include more than 100 nations from five continents. The IMO, held annually in different host countries, presents competitors with six complex mathematical problems—three per day—spread over two days.

This dataset originates from the International Mathematical Olympiad and contains detailed information about participating countries. It includes the distribution of genders within each team, individual scores for different sections of the competition, the number of gold, silver, and bronze awards won, the count of honorable mentions received by each country, as well as the names of the team leader and deputy leader.

The goal of this study is to determine the most effective clustering technique for examining the performance of countries participant over decades. The objective is to identify distinct patterns or groupings in the data that reflect the historical and comparative performance across different time periods and geographic regions, thus providing insights into the evolution of mathematical proficiency globally.

2 Data Preprocessing and Data Cleaning

2.1 Preprocessing data

dim(df)
## [1] 3780   18
head(df)
## # A tibble: 6 × 18
##    year country  team_size_all team_size_male team_size_female    p1    p2    p3
##   <dbl> <chr>            <dbl>          <dbl>            <dbl> <dbl> <dbl> <dbl>
## 1  2024 United …             6              5                1    42    41    19
## 2  2024 People'…             6              6                0    42    42    31
## 3  2024 Republi…             6              6                0    42    37    18
## 4  2024 India                6              6                0    42    34    11
## 5  2024 Belarus              6              6                0    42    30    10
## 6  2024 Singapo…             6              6                0    42    37     7
## # ℹ 10 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, p7 <lgl>,
## #   awards_gold <dbl>, awards_silver <dbl>, awards_bronze <dbl>,
## #   awards_honorable_mentions <dbl>, leader <chr>, deputy_leader <chr>
summary(df)
##       year        country          team_size_all   team_size_male
##  Min.   :1959   Length:3780        Min.   :1.000   Min.   :0.0   
##  1st Qu.:1995   Class :character   1st Qu.:6.000   1st Qu.:5.0   
##  Median :2006   Mode  :character   Median :6.000   Median :6.0   
##  Mean   :2004                      Mean   :5.742   Mean   :5.2   
##  3rd Qu.:2016                      3rd Qu.:6.000   3rd Qu.:6.0   
##  Max.   :2024                      Max.   :8.000   Max.   :8.0   
##                                                    NA's   :283   
##  team_size_female       p1              p2              p3        
##  Min.   :0.000    Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:1.000    1st Qu.:12.00   1st Qu.: 3.25   1st Qu.: 0.000  
##  Median :1.000    Median :26.00   Median :12.00   Median : 2.000  
##  Mean   :1.066    Mean   :24.74   Mean   :15.44   Mean   : 6.958  
##  3rd Qu.:1.000    3rd Qu.:38.00   3rd Qu.:26.00   3rd Qu.:10.000  
##  Max.   :6.000    Max.   :56.00   Max.   :56.00   Max.   :64.000  
##  NA's   :2180     NA's   :110     NA's   :110     NA's   :110     
##        p4              p5              p6            p7         
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Mode:logical  
##  1st Qu.:10.00   1st Qu.: 2.00   1st Qu.: 0.000   NA's:3780     
##  Median :23.00   Median :10.00   Median : 1.000                 
##  Mean   :23.01   Mean   :14.09   Mean   : 5.698                 
##  3rd Qu.:36.00   3rd Qu.:23.00   3rd Qu.: 7.000                 
##  Max.   :56.00   Max.   :56.00   Max.   :63.000                 
##  NA's   :110     NA's   :110     NA's   :110                    
##   awards_gold     awards_silver    awards_bronze   awards_honorable_mentions
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.000            
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000            
##  Median :0.0000   Median :0.0000   Median :1.000   Median :1.000            
##  Mean   :0.4706   Mean   :0.9603   Mean   :1.417   Mean   :1.177            
##  3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:2.000   3rd Qu.:2.000            
##  Max.   :6.0000   Max.   :6.0000   Max.   :6.000   Max.   :6.000            
##  NA's   :2        NA's   :2        NA's   :2       NA's   :515              
##     leader          deputy_leader     
##  Length:3780        Length:3780       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

The data consists of 3780 observations and 18 variables. Overall, there are 68 040 data points.

In the International Mathematical Olympiad, the maximum members in one team is 6, some of the teams have had both male and female, some of them has only male or female nominators.

unique_countries <- unique(df$country)
num_unique_countries <- length(unique_countries)
num_unique_countries
## [1] 139

There are the total of 139 countries over 5 continents that participated in the International Mathematical Olympiad from 1959 to 2024.

2.2 Data Cleaning

First of all, the missing value will be checked.

colSums(is.na(df))
##                      year                   country             team_size_all 
##                         0                         0                         0 
##            team_size_male          team_size_female                        p1 
##                       283                      2180                       110 
##                        p2                        p3                        p4 
##                       110                       110                       110 
##                        p5                        p6                        p7 
##                       110                       110                      3780 
##               awards_gold             awards_silver             awards_bronze 
##                         2                         2                         2 
## awards_honorable_mentions                    leader             deputy_leader 
##                       515                       870                       968
#there are 2 columns regarding information about leader and deputy leader for each team that will not be used in further clustering, hence, they will be removed. Also, the entire p7 column will also be get rid of as the data is unfilled
df_cleaned <- df[, !colnames(df) %in% c("p7", "leader", "deputy_leader")]

#the missing value for gender contribution - team size male/female is due to the fact that there are sorely male or female in the team, hence, missing value will be change to 0
df_cleaned$team_size_male[is.na(df_cleaned$team_size_male)] <- 0
df_cleaned$team_size_female[is.na(df_cleaned$team_size_female)] <- 0

#check for mising data in p1-p6 (points from every exam from the competition)
df_cleaned[is.na(df_cleaned$p1) | is.na(df_cleaned$p2) | is.na(df_cleaned$p3) | is.na(df_cleaned$p4) | is.na(df_cleaned$p5) | is.na(df_cleaned$p6), ]
## # A tibble: 110 × 15
##     year country team_size_all team_size_male team_size_female    p1    p2    p3
##    <dbl> <chr>           <dbl>          <dbl>            <dbl> <dbl> <dbl> <dbl>
##  1  2010 Democr…             6              0                0    NA    NA    NA
##  2  1991 Democr…             6              0                0    NA    NA    NA
##  3  1983 Germany             6              6                0    NA    NA    NA
##  4  1983 United…             6              6                0    NA    NA    NA
##  5  1983 Hungary             6              6                0    NA    NA    NA
##  6  1983 Union …             6              6                0    NA    NA    NA
##  7  1983 Romania             6              6                0    NA    NA    NA
##  8  1983 Vietnam             6              6                0    NA    NA    NA
##  9  1983 Bulgar…             6              5                1    NA    NA    NA
## 10  1983 France              6              6                0    NA    NA    NA
## # ℹ 100 more rows
## # ℹ 7 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## #   awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>
#there are 110 rows that missing value from p1 to p6, meaning that those countries were absence from the competition that year. Therefore, all those rows will be removed
df_cleaned <- df_cleaned[!apply(is.na(df_cleaned[, c("p1", "p2", "p3", "p4", "p5", "p6")]), 1, any), ]

#for missing value in award honorable mention, missing value will be filled with 0
df_cleaned$awards_honorable_mentions [is.na(df_cleaned$awards_honorable_mentions )] <- 0
df_cleaned
## # A tibble: 3,670 × 15
##     year country team_size_all team_size_male team_size_female    p1    p2    p3
##    <dbl> <chr>           <dbl>          <dbl>            <dbl> <dbl> <dbl> <dbl>
##  1  2024 United…             6              5                1    42    41    19
##  2  2024 People…             6              6                0    42    42    31
##  3  2024 Republ…             6              6                0    42    37    18
##  4  2024 India               6              6                0    42    34    11
##  5  2024 Belarus             6              6                0    42    30    10
##  6  2024 Singap…             6              6                0    42    37     7
##  7  2024 United…             6              6                0    42    33     8
##  8  2024 Hungary             6              6                0    42    37    16
##  9  2024 Poland              6              6                0    42    25     5
## 10  2024 Türkiye             6              5                1    38    37     5
## # ℹ 3,660 more rows
## # ℹ 7 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## #   awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>

3 Feature Engineer

In order to reveal the performance by decade of every countries in the IMO competition, I will aggregate the information into form which can produce meaningful clusters.

The average score from 6 parts within every year competition will be calculated; the total number of awards, and the total awards honorable mentions will also be summarize by each year.

df_cleaned$total_awards <- df_cleaned$awards_gold + df_cleaned$awards_silver + df_cleaned$awards_bronze
df_cleaned$average_score <- rowMeans(df_cleaned[, c("p1", "p2", "p3", "p4", "p5", "p6")])
#create the timeline for every decades instead of years
df_cleaned$decade <- floor(df_cleaned$year / 10) * 10
head(df_cleaned)
## # A tibble: 6 × 18
##    year country  team_size_all team_size_male team_size_female    p1    p2    p3
##   <dbl> <chr>            <dbl>          <dbl>            <dbl> <dbl> <dbl> <dbl>
## 1  2024 United …             6              5                1    42    41    19
## 2  2024 People'…             6              6                0    42    42    31
## 3  2024 Republi…             6              6                0    42    37    18
## 4  2024 India                6              6                0    42    34    11
## 5  2024 Belarus              6              6                0    42    30    10
## 6  2024 Singapo…             6              6                0    42    37     7
## # ℹ 10 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## #   awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>,
## #   total_awards <dbl>, average_score <dbl>, decade <dbl>
pfm_aggregate <- df_cleaned %>% group_by(country, decade) %>%
  summarise(
    average_score = mean(average_score),
    total_awards = sum(total_awards),
    awards_honorable_mentions = sum(awards_honorable_mentions),
    .groups = "drop")
pfm_aggregate
## # A tibble: 541 × 5
##    country decade average_score total_awards awards_honorable_mentions
##    <chr>    <dbl>         <dbl>        <dbl>                     <dbl>
##  1 Albania   1990         3.42             0                         2
##  2 Albania   2000         6.63             7                        12
##  3 Albania   2010         6.60             2                        18
##  4 Albania   2020         7.07             2                        13
##  5 Algeria   1980         6.83             3                         1
##  6 Algeria   1990         2.54             0                         1
##  7 Algeria   2000         0.333            0                         0
##  8 Algeria   2010         7.83             4                        11
##  9 Algeria   2020         9.4              4                        11
## 10 Angola    2010         0                0                         0
## # ℹ 531 more rows

For clustering, I will be using numerical columns and normalize them on the same scale.

cluster_data <- pfm_aggregate[3:5] #numerical columns only
scaled_data <- scale(cluster_data)
head(scaled_data)
##      average_score total_awards awards_honorable_mentions
## [1,]    -1.0959967   -1.0280125                -0.6707646
## [2,]    -0.7592170   -0.6533173                 0.6435550
## [3,]    -0.7622707   -0.9209567                 1.4321468
## [4,]    -0.7138477   -0.9209567                 0.7749870
## [5,]    -0.7382774   -0.8674288                -0.8021965
## [6,]    -1.1876078   -1.0280125                -0.8021965

Visualize the data points for clustering on a 3D scatter plot

4 Clustering

4.1 Is this data cluster-able?

Before perform any clustering method, I would like to check the ability to clusters by calculating the Hopkin stat

get_clust_tendency(scaled_data, n = nrow(scaled_data) - 1, graph = TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9028108
## 
## $plot

Beside the heat map with clear structured data, distinct lines and blocks, the result of Hopkin stat is also really close to 1. In this case, the Hopkin value is 0.903. Therefore, I would say the dataset has high cluster-ability.

4.2 The optimal number of clusters.

optnb<-NbClust(scaled_data, distance="euclidean", min.nc=2, max.nc=10, method="complete", index="ch")
optnb
## $All.index
##        2        3        4        5        6        7        8        9 
## 177.4881 261.5802 544.4936 555.5426 510.3006 464.4564 455.7308 487.0920 
##       10 
## 468.3016 
## 
## $Best.nc
## Number_clusters     Value_Index 
##          5.0000        555.5426 
## 
## $Best.partition
##   [1] 1 2 2 2 1 1 1 2 2 1 1 3 4 2 2 3 2 2 3 3 4 4 5 3 5 3 3 2 2 2 1 2 2 2 1 1 2
##  [38] 2 3 5 4 3 1 1 3 2 2 2 2 1 1 1 1 1 1 2 1 2 2 3 1 1 1 3 2 4 4 3 1 3 3 5 5 5
##  [75] 4 3 1 1 1 1 1 3 4 4 5 3 1 1 2 1 3 3 4 2 2 3 1 2 2 3 2 4 3 1 3 2 1 1 1 1 1
## [112] 2 2 2 3 4 2 3 5 3 5 3 3 3 5 1 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1 3 3 2
## [149] 2 2 2 3 3 3 4 4 4 3 1 2 4 2 3 5 5 5 3 3 5 5 4 4 3 1 1 1 3 2 2 2 3 1 1 1 1
## [186] 1 1 1 3 4 4 4 3 5 5 5 5 5 4 3 1 1 1 2 2 3 5 4 4 3 1 1 2 5 3 1 1 1 1 2 2 2
## [223] 3 5 5 5 3 3 3 4 4 4 3 3 1 4 4 4 3 1 1 1 5 5 5 3 2 4 4 3 1 1 2 2 1 1 1 1 1
## [260] 2 2 2 1 3 2 2 2 1 1 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 2 2 3 1 1 1 2 4 4 3 3
## [297] 1 3 2 4 2 3 1 1 1 3 2 2 2 2 1 1 1 1 1 1 3 3 2 2 4 3 1 3 2 2 2 1 2 1 1 2 1
## [334] 3 2 2 2 1 3 2 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 3 5 5 5 5 1 1 4 4 3 1 1 1 4 3
## [371] 3 3 3 4 4 4 3 1 1 2 2 2 1 1 1 1 3 5 5 5 5 1 4 2 2 5 5 5 5 5 5 3 5 5 5 3 1
## [408] 1 4 3 3 4 3 3 3 4 4 5 3 3 4 4 3 2 2 2 2 2 2 2 2 1 2 2 2 3 1 2 2 2 3 3 3 3
## [445] 2 2 2 1 2 2 3 1 2 2 5 5 5 3 1 2 2 1 1 1 2 4 5 3 1 2 1 1 1 1 1 2 2 1 1 2 2
## [482] 2 1 3 4 4 5 3 1 1 4 5 5 3 5 5 5 3 1 1 1 3 5 5 5 5 4 3 5 5 5 5 5 5 1 1 1 2
## [519] 2 1 2 2 2 1 1 1 1 1 3 5 5 5 5 3 3 3 3 4 3 1 1

Based on the suggestion from NBClust, the optimal number of clusters for my data set will be 5. However, other methods to verify the initial number of cluster should also be applied. Therefore, I would like to also check the other methods such as elbow, silhouette and AIC to see how many clusters they suggest.

opteb <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters = TRUE)

optsh <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

optaic <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters=TRUE, criterion="AIC")

From these plots, the results are consistent with the appropriate number of clusters is 3, as they achieve the highest silhouette score and the additional clusters do not significantly impact on the final performance of capturing additional structure of data.

Since my data set is not large, I would like to see the distribution of data with 3 different clustering methods, which are K-means, PAM and hierarchical.

4.3 K-Means

4.4 PAM

4.5 Hierachical clustering

For agglomerative method, first of all, I will compute the coefficient between 4 methods: “average”, “single”, “complete” and “ward”

##    single   average  complete      ward 
## 0.7889226 0.9548715 0.9784068 0.9965703

Ward has the highest score among all the methods, therefore, I would like to visualize it on a dendrogram.

For divisive method, a dendrogram with Diana will be plot.

For each dendrogram, using scater plot to visualize the clusters

Given the differences in how these methods calculate clusters, the final clusters are slightly different. In the next part, I will compute the silhouette score over all methods to see which method performs the best.

4.6 Performance comparison between methods

## K-means average silhoutte width: 0.4907977
## PAM average silhoutte width: 0.4819316
## Ward verage silhoutte width 0.4697884
## Diana verage silhoutte width with 0.4934603
##   cluster size ave.sil.width
## 1       1  188          0.48
## 2       2  143          0.40
## 3       3  210          0.56
##   cluster size ave.sil.width
## 1       1  183          0.62
## 2       2  161          0.36
## 3       3  197          0.46
##   cluster size ave.sil.width
## 1       1  203          0.59
## 2       2  108          0.48
## 3       3  230          0.36
##   cluster size ave.sil.width
## 1       1  216          0.56
## 2       2  125          0.45
## 3       3  200          0.45

The silhouette score for K-means, PAM and Diana are relatively similar. The Ward method has the lowest silhouette width, implying less effective clustering compared to the other methods.

While Diana shows a slightly better clustering quality compare to the rest, considering the efficiency and scalability of K-means and the minimal difference in silhouette score, makes it the preferred choice for further clustering and analysis. K-means offers a balance between quality and practicality, ensuring accurate clustering while remaining computationally efficient.

5 K-means cluster evaluation

Regarding the recommendation about number clusters from NbClust in the previous stage, it suggests to cluster my data set into 5 clusters. So now, I would like to compute the Calinski-Harabasz index to verify about the quality of clustering regarding the number of clusters as 3 or 5.

## Calinski-Harabasz index for 3 clusters: 665.4
## Calinski-Harabasz index for 5 clusters: 598.23

From the Calinski-Harabasz index, the higher statistics the better. Therefore, 3 clusters would produce better insight rather than 5.

Besides, the shadow statistic will also be used to evaluate clustering quality. It is closely related to the silhouette score but provides additional insights into cluster cohesion and separation by focusing on the distance of each data point to its cluster centroid and the second-closest centroid.

##         1         2         3 
## 0.5470393 0.5913554 0.4825425

In contrast to silhouette values, which provide a direct measure of how well separated the cluster are, shadow values give insight into geometric arrangement of the points within the cluster relative to their centroids. From the shadow statistic, I can see that all clusters receive roughly 0.5 in scores, suggest reasonably good clustering quality.

6 Data Visualization

Apply result from clustering by K-means to the 3D scatter plot of data points

From the plot, I can recognize the trace of 3 clusters. Cluster 1 (red) are those who received both high score and number of awards. However, they have fewer honorable mentions compared to their overall performance.

For cluster 2 (yellow), they maintain an average score and a reasonable number of total awards. Interestingly, despite these averages scores, this cluster is characterized by a relatively high number of award mentions.

The 3rd cluster (green) includes countries with the lowest performance in terms of scores and total awards, accompanied by low honorable mentions.

These violin plots below will better demonstrate the distribution from each clusters.

The insights of cluster can be critical for understand the strength and weaknesses of different Olympiad teams. Those countries that fall into cluster 3 could benefit from increased support and resources to enhance their performance in future competitions.

To have more vivid picture on the performance of countries participated in IMO, I have plotted this map to serve as a tool for understanding the global participation trends, visualize how different countries were grouped into clusters (1, 2, or 3) across various decades, the presence of NA values is for countries that have never been participate in the IMO.

In the 1960s, there was relatively small number of countries participated, mostly from the Soviet Union and Eastern Europe. Later on, the number of participants in the Mathematical Olympiad expands, there is a noticeable variability in cluster members by decade. Until now, there are participants from 5 continents around the world.

Cluster 1 was dominant in earlier decades, representing countries that excelled in terms of both awards and scores. Most of the countries in this cluster are from Europe, America and Australia, which are typically developed nations with long-standing traditions in competitive mathematics.

There is a noticeable trend among the former Soviet Union countries during the 2000s, where their performance declined significantly, moving them from the top-performing cluster to the lowest-performing one. However, this trend was temporary, as these countries regained their standing in the well-performing cluster during the 2010s and 2020s, showcasing a strong recovery in their results.

Cluster 2 exhibits moderate performance across decades. However, their geographical distribution is more fragmented, suggesting that moderate performers are spread across different regions without a strong concentration.

Countries in cluster 3 are predominantly from Africa and other regions with relatively lower scores and fewer awards. A significant trend is observed between the 1980s and 1990s, where some participants from South America improved their performance and transitioned from cluster 3 to cluster 2. This demonstrates the developments in mathematics training and education during that time in those countries.

Additionally, it is worth to mentioned that even though China and Australia joined the contest relatively later, in the 1980s, they have maintained consistent performance over the decades. Their strong and stable results are reflected in their consistent appearance in cluster 1, which represents countries with moderate to high performance in terms of scores and awards.

In conclusion, earlier decades are dominated by high-performing countries, reflecting the initial stronghold of a few nations (Hungary, Romania, Russia) while later decades show increased diversity and there are more countries that fall into moderate or low-performance clusters.

7 References:

The official website of International Mathematical Olympiad, which can be found here